Data Science and Machine Learning for Geospatial Data

Exploratory Data Analysis of the HURDAT2 Data for Hurricane Prediction using Tableau and Python

Table of contents

Introduction

Data Extraction

EDA with Tableau

EDA with Python

Introduction

Atlantci Hurricane dataset (known as Atlantic HURDAT2) has a comma-delimited, text format with six-hourly information on the location, maximum winds, central pressure, and (beginning in 2004) size of all known tropical cyclones and subtropical cyclones. As a part of our assignment we will first do exploratory data analysis of dataset using Tableau and Python. After that we will do data processing , build machine learning model , test and evaluate the model. We divided our work into notedbooks , this nootbook contain EDA of the dataset and the Hurrican_Prediction_model notebook contains the worklflow of model building, testing and evaluation.

Importing the Data into the Workspace

The initial Exploration of the Datasets shows us that the data is made up of 49105 records and 22 columns corresponding to different value fields from ID of the Tropical Storm to the Directional Wind Speed. We will proceed to get this data in the right form and shape suitable for developing our models.

Data Cleaning

In this steps, we will apply series of data preparation techniques to our data to get it to the right form and shape suitable for developing a robust machine learning model.

The first transformation we will make is with the Location Attributes: Lattitude and Longitudes. In the unprocessed data, these values were recorded with a Northing and Easting attributes to showcase which part of the hemisphere and the meridian they fall on. Although useful, we will need to convert our field into numerical values that corresponds to the absolute coordinate of the points.

Removing Null Values or Missing Data

Another step is to remove the missing data. As with most meteorological records, the missing data are usually recorded as -999, connoting that either no observation was taken or the values are missing. We will need to discard the inputs or replace them with a constant value. In this case, we will first explore the data to see the degree of missing values that exist in the data.

Based on the computation made above, we can see that the missing values in many of the columns connoting Meteorological Variables reaches as much as 87.94%, which is way too much. Working with this kind of data can significantly introduce biases to our model and lead to many false positives or wrong predictions. Hence, we will drop the other columns and stick with the columns that have no missing values.

###########################################################################################

Geographic Coordinates

In order to make it easy to plot the locations of points of tropical storms on a map, the Pandas DataFrame will be converted to a GeoDataFrame using the GeoPandas Package and the Geometric Points will be derived from the Lattitude and Longitude.

Exploratory Data Analysis

In this phase, we will explore the data by creating a number of plots and charts to show the distribution of values and to understand the data in more depth.

Geographical Distribution of Hurricanes

Top 5 Hurricanes by Frequency

Apparently, the unnamed storms are the most common storms in the datasets accounting for the tropical storms that had occured before the Storm/Hurricane Naming System was introduced.

Probability Distribution Function of Frequency

Category Wise Frequency Distribution of Cyclones

Landfall Stroms

Typhoons that made Landfalls


There are 4 status that stands out that accounts for ~98% of all measurements. Tropical Storm (TS) is the most frequent status with wind speed of 34-63 knots, followed by Tropical Depression (TD) with wind speed of up to 34 knots, Hurricane (HU) with wind speed of at least 64 knots and a Low (LO) that is neither a tropical cyclone, a subtropical cyclone, nor an extratropical cyclone (of any intensity).

Cyclicality:

On average there are nearly 10 storms occurring each season, with 2005 being the most active season with 28 storms. The data shows there is a detectable increase in the frequency and severity of hurricanes in the past few decades. Is climate change the reason for observing increased severity, and clustering effects of hurricanes as we witnessed in 2017? Plotting a 10 and 25 year moving average shows a multi-decadal oscillation.

Best Strom Track

Plot tracks of Hurricanes

EDA with Tableau

Tableau is a powerful tool tor EDA. The visualization below maps the trajectory of hurricanes for the given year. The size of dots is directly proportional to the windspeed at that location and color shows the different hurricanes.

Hurricane_path.PNG

Wind Speed and Pressure Correlation graph

Wind_Speed.PNG

Peak Season Graph

The graph shows the distribution of hurricanes per month and it shows that spetember produced the strongest of hurricanes over all years.

Most_hurricane.PNG